PINS: A Perturbation Clustering Approach for Data Integration and Disease Subtyping

ABSTRACT

Disease subtyping is accomplished by a computer-implemented algorithm that manipulates a first genetic dataset to construct a set of first connectivity matrices. To this set of matrices Gaussian noise is introduced to generate a perturbed dataset. The computer-implemented algorithm assesses which of the set of first connectivity matrices was least affected by introduction of noise and that matric is used to define the optimal clustering. Once the optimal clustering is determined, computer-implemented supervised classification is performed to determine, for a particular patient, with which disease subtype cluster that person&#39;s genetic data most closely aligns. Armed with this knowledge, the treatment regimen is specified with much higher likelihood of success.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a divisional of U.S. application Ser. No. 15/068,048filed on Mar. 11, 2016, which claims the benefit of U.S. ProvisionalApplication No. 62/132,263 filed on Mar. 12, 2015 and U.S. ProvisionalApplication No. 62/221,727 filed on Sep. 22, 2015. The entire disclosureof each of the above applications is incorporated herein by reference.

FIELD

The present disclosure relates to disease diagnosis using geneticanalysis.

BACKGROUND

The advent of high-throughput genomics technologies has resulted inmassive amounts of diverse genome-scale data. Gene expression data,measured by microarrays or next generation sequencing platforms, are themost prevalent data type available for biological data analysis. GeneExpression Omnibus stores thousands of datasets with independentexperimental series of similar patient cohorts and experiment design. Astechnologies advance, other data types become available and togetherthey offer complementary information on the same disease or biologicalphenomenon. The Cancer Genome Atlas (TCGA) has already gathered genome,transcriptome, and epigenome information for over 20 cancers forthousands of patients. The challenge is to interpret the massive amountsof high-dimensional and heterogeneous data types to gain insights intobiological processes.

Disease subtyping is often the first step to better understand a diseaseor biological phenomenon. The goal is to detect unknown groups ofpatients based on intrinsic features without external information. Thedisease subtyping problem includes the following fundamental issues: 1)how to determine the number of clusters and assign patients to eachgroup, 2) how to combine complementary information to determine thefinal partitioning. The former problem often involves clustering mRNAexpression where the data has small sample size but very high dimension.This is still an important problem since gene expression is one of themost prevalent data type available. The latter problem includesintegration of multi-omics data, such as mRNA expression, DNAmethylation, and miRNA, for class discovery. With the rapidly advancingtechnologies, more and more data types are available for the same set ofpatients, making the increasing need for combining multi-omics data.

In functional genomics, agglomerative hierarchical clustering (HC) is afrequently used approach for clustering genes or samples that showsimilar expression patterns. HC pro-vides for a structural view of thedata that makes it appealing in exploratory data analysis. However,classical HC imposes a tree structure on the data that might not reflectthe underlying structure, and is highly sensitive to the metric used toassess similarity among elements. Divisive clustering methods, such ask-means, global k-means, fuzzy modification of k-means, have beenapplied for the same application. These methods provide clear clusterboundaries and tighter clusters, but they lack the visual appeal of HC.Another group of methods are neural network clustering, such asself-organizing maps (SOM), Self-Organizing Tree Algorithm (SOTA), andDynamically Growing Self-Organizing Tree (DGSOT). Neural networks can bemodeled as a collection of nodes with weighted interconnections, whichcan be adaptively learned. The common drawbacks of both k-means basedmethods and neural networks based methods is the need to specify thenumber of clusters beforehand.

Resampling-based methods have been proposed to determine the number ofclusters. They assess the stability of the clustering results withrespect to resampling variability. Arguably the state-of-the-artapproach in this area is Consensus Clustering (CC). It develops ageneral, model independent resampling-based methodology of classdiscovery, cluster validation, and visualization. CC calculates thepair-wise similarities (frequency of how often the elements are groupedtogether) and their empirical cumulative distribution function (CDF)using sub-sampling. The pair-wise similarities are then used forvisualization and for estimating the cluster number. This approach hasbeen widely used and gained laudable results. The main assumption of CCis that if the samples were drawn from K distinct sub-populations thattruly exist, different sub-samples would show the greatest level ofstability at the true K. Unfortunately, this makes CC claim apparentstructure when there is none, or declare cluster stability when thestability is subtle.

The goal of an integrative analysis is to identify subgroups of samplesthat are similar not only at one level (e.g., mRNA), but from a holisticperspective, that can take into consideration phenomena at various otherlevels (e.g., DNA methylation, miRNA, etc.). One strategy is to analyzeeach data type independently before combining them. One of the drawbacksof this approach is that it might lead to inconsistent conclusions thatare hard to integrate. Another approach is to use machine learningtechniques. However, these methods are not scalable to the full spectrumof measurements, making them sensitive to gene selection step. Onerecent approach, Similarity Network Fusion (SNF), creates a network ofpatients for each data type before fusing the network using a metricfusion technique developed for image processing applications. The fusednetwork is then partitioned using spectral clustering. The unstablenature of the spectral clustering and the metric fusion technique makesthe method sensitive to its parameters. In addition, this method is notdesigned to solve the clustering when only one data type is available.

SUMMARY

Here we present a new approach to address both of the mentioned issues.Our framework is divided into two stages. In the first stage, we solvethe classical clustering problem given a single data type. Althoughseveral specific high-dimensionality data types are illustrated in ourexamples here, our technology is general enough to be applicable for anyhigh-dimensional genetic or life science data. The second stage combinesthe partitionings of individual data types to determine the finalpartitioning. In our experimental study, we evaluate the first stage byclustering 8 gene expression datasets of different diseases. For all the8 datasets, PINS outperforms its competitors in recovering the trueclasses. To evaluate the second stage, we downloaded mRNA, DNAmethylation, and miRNA data of 6 difference cancers from TCGA: kidneyrenal clear cell carcinoma (KIRC), glioblastoma (GBM), lung squamouscell carcinoma (LUSC), breast invasive carcinoma (BRCA), acute myeloidleukemia (LAML), and colon adeno-carcinoma (COAD) with survival andclinical data. PINS substantially outperforms other methods inidentifying subtypes and in predicting survival using the multi-omicsdata.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. 1A-1B (collectively referred to as FIG. 1) are flowchart diagramsdepicting the Stage I process for perturbation clustering for highdimensional data;

FIGS. 2A-2D (collectively referred to as FIG. 2) depicts a firstexemplary dataset (Gaussian1-1 class) associated with the perturbationclustering process of FIG. 1, specifically:

FIG. 2A—also referred to as FIG. 2, Panel (A)—is a graphical plot of thegene expression profile of the exemplary dataset;

FIG. 2B—also referred to as FIG. 2, Panel (B)—is a series ofconnectivity matrices, comparing original data vs perturbed data fordifferent values of k;

FIG. 2C—also referred to as FIG. 2, Panel (C)—is a graph of theempirical cumulative distribution functions (CDF) of the differencematrix Dk;

FIG. 2D—also referred to as FIG. 2, Panel (D)—is a graph of the areaunder the curve (AUC) of the CDFs for each value of k;

FIGS. 3A-3D (collectively referred to as FIG. 3) depicts a secondexemplary dataset (Gaussian2-2 classes) associated with the perturbationclustering process of FIG. 1, the individual FIG. 3 Panels (A)-(D)displaying comparable information as described in connection with FIG.2;

FIGS. 4A-4D (collectively referred to as FIG. 4) depicts a secondexemplary dataset (Gaussian3-3 classes) associated with the perturbationclustering process of

FIG. 1, the individual FIG. 4 Panels (A)-(D) displaying comparableinformation as described in connection with FIG. 2;

FIGS. 5A-5D (collectively referred to as FIG. 5) depicts a secondexemplary dataset (Gaussian5-5 classes) associated with the perturbationclustering process of FIG. 1, the individual FIG. 5 Panels (A)-(D)displaying comparable information as described in connection with FIG.2;

FIGS. 6A-6D (collectively referred to as FIG. 6) depicts a secondexemplary dataset (Gaussian9-9 classes) associated with the perturbationclustering process of FIG. 1, the individual FIG. 6 Panels (A)-(D)displaying comparable information as described in connection with FIG.2;

FIG. 7 is an area under the curve (AUC) graph for ten (10) simulateddatasets;

FIGS. 8A-8D (collectively referred to as FIG. 8) depict data for thelung cancer dataset GSE19188, useful in understanding the disclosedclustering technique, specifically:

FIG. 8A—also referred to as FIG. 8, Panel (A)—is a set of connectivitymatrices, comparing original data vs perturbed data for k=3 and k=6;

FIG. 8B—also referred to as FIG. 8, Panel (B)—depicts the cumulativedistribution functions (CDF) for different values of k;

FIG. 8C—also referred to as FIG. 8, Panel (C)—depicts the area under thecurve (AUC) graph for dataset GSE19188 and dataset Gaussian1;

FIG. 8D—also referred to as FIG. 8, Panel (D)—depicts the clusteringresult in the first two principal components, where the circlesrepresent the LCC samples; the triangles represent the ADC samples; thecrosses represent the SCC samples;

FIG. 9 is a data flow diagram illustrating data integration and diseasesubtyping for a kidney renal clear cell carcinoma (KIRC);

FIGS. 10A-10C (collectively referred to as FIG. 10) are cluster diagramsfor dataset GSE10245, comparing the presently disclosed PINS technique(FIG. 10A) with the SNF technique (FIG. 10B) and the CC technique (FIG.9C);

FIGS. 11A-11C (collectively referred to as FIG. 11) are cluster diagramsfor dataset GSE19188, comparing the presently disclosed PINS technique(FIG. 11A) with the SNF technique (FIG. 11B) and the CC technique (FIG.11C);

FIGS. 12A-12C (collectively referred to as FIG. 12) are cluster diagramsfor dataset GSE43580, comparing the presently disclosed PINS technique(FIG. 12A) with the SNF technique (FIG. 12B) and the CC technique (FIG.12C);

FIGS. 13A-13C (collectively referred to as FIG. 13) are cluster diagramsfor dataset GSE14924, comparing the presently disclosed PINS technique(FIG. 13A) with the SNF technique (FIG. 13B) and the CC technique (FIG.13C);

FIGS. 14A-14C (collectively referred to as FIG. 14) are cluster diagramsfor dataset GSE15061, comparing the presently disclosed PINS technique(FIG. 14A) with the SNF technique (FIG. 14B) and the CC technique (FIG.14C);

FIGS. 15A-15C (collectively referred to as FIG. 15) are cluster diagramsfor dataset AML2004, comparing the presently disclosed PINS technique(FIG. 15A) with the SNF technique (FIG. 15B) and the CC technique (FIG.15C);

FIGS. 16A-16C (collectively referred to as FIG. 16) are cluster diagramsfor dataset Lung2001, comparing the presently disclosed PINS technique(FIG. 16A) with the SNF technique (FIG. 16B) and the CC technique (FIG.16C);

FIGS. 17A-17C (collectively referred to as FIG. 17) are cluster diagramsfor dataset Brain2002, comparing the presently disclosed PINS technique(FIG. 17A) with the SNF technique (FIG. 17B) and the CC technique (FIG.17C);

FIG. 18A-18B (collectively referred to as FIG. 18) are Kaplan-Meiersurvival analysis graphs for kidney renal clear cell carcinoma (KIRC),comparing the presently disclosed PINS technique (FIG. 18A) with the SNFtechnique (FIG. 18B);

FIG. 19A-19B (collectively referred to as FIG. 19) are Kaplan-Meiersurvival analysis graphs for glioblastoma multiform (GMB), comparing thepresently disclosed PINS technique (FIG. 19A) with the SNF technique(FIG. 19B);

FIG. 20A-20B (collectively referred to as FIG. 20) are Kaplan-Meiersurvival analysis graphs for lung squamous cell carcinoma (LUSC),comparing the presently disclosed PINS technique (FIG. 20A) with the SNFtechnique (FIG. 20B);

FIG. 21A-21B (collectively referred to as FIG. 21) are Kaplan-Meiersurvival analysis graphs for breast invasive carcinoma (BRCA), comparingthe presently disclosed PINS technique (FIG. 21A) with the SNF technique(FIG. 21B);

FIG. 22A-22B (collectively referred to as FIG. 22) are Kaplan-Meiersurvival analysis graphs for acute myeloid leukemia (LAML), comparingthe presently disclosed PINS technique (FIG. 22A) with the SNF technique(FIG. 22B);

FIG. 23A-23B (collectively referred to as FIG. 23) are Kaplan-Meiersurvival analysis graphs for colon adenocarcinoma (COAD), comparing thepresently disclosed PINS technique (FIG. 23A) with the SNF technique(FIG. 23B);

FIG. 24A-24B (collectively referred to as FIG. 24) are Kaplan-Meiersurvival analysis graphs for glioblastoma multiform (GMB) phase 1 (FIG.24A) and phase 2 (FIG. 24B), applying the presently disclosed PINStechnique;

FIG. 25 is a heatmap of features differentials among glioblastomamultiform (GMB) subtypes, comparing three data types: mRNA, DNAmethylation, and miRNA;

FIG. 26 is a chart showing age distribution of the discovered subtypesfor glioblastoma multiform (GMB);

FIG. 27A-27B (collectively referred to as FIG. 27) are Kaplan-Meiersurvival analysis graphs for kidney renal clear cell carcinoma (KIRC)phase 1 (FIG. 27A) and phase 2 (FIG. 27B), applying the presentlydisclosed PINS technique;

FIG. 28 is a heatmap of features differentials among kidney renal clearcell carcinoma (KIRC) subtypes, comparing three data types: mRNA, DNAmethylation, and miRNA;

FIG. 29 is a chart showing age distribution of the discovered subtypesfor kidney renal clear cell carcinoma (KIRC)

FIG. 30 is a simplified flow diagram depicting the major steps of thedisclosed PINS technique;

FIGS. 31a and 31b are simplified flow diagrams depicting the major stepsof the disclosed technique for subtyping multi-omics data;

FIG. 32 is a computer system diagram illustrating one combined hardwareand software embodiment of implementing the disclosed technique.

DESCRIPTION OF PREFERRED EMBODIMENTS

In this disclosure, we present a new technological approach to addressboth of the mentioned issues. We refer to our new technological approachby the acronym PINS (Perturbation clustering approach for dataINtegration and disease Subtyping). Our technology is divided into twostages. In the first stage, we solve in a new way the classicalclustering problem given a single data type. While particularly wellsuited to analyzing genetic data, our approach is general enough to beapplicable for any high-dimensional data. The second stage combines thepartitionings of individual data types to determine the finalpartitioning.

In our experimental study described here, we evaluate the first stage byclustering 8 gene expression datasets of different diseases. For all the8 datasets, PINS outperforms its competitors in recovering the trueclasses. To evaluate the second stage, we downloaded mRNA, DNAmethylation, and miRNA data of 6 difference cancers from TCGA: kidneyrenal clear cell carcinoma (KIRC), glioblastoma (GBM), lung squamouscell carcinoma (LUSC), breast invasive carcinoma (BRCA), acute myeloidleukemia (LAML), and colon adenocarcinoma (COAD) with survival andclinical data. PINS substantially outperforms other methods inidentifying subtypes and in predicting survival using the multi-omicsdata.

II. METHODS

Here we describe the perturbation clustering for a single data type anddata integration for multiple data type. This section is organized asfollows. Section II-A describes the perturbation clustering (stage I),in which the patients are partitioned using one data type. This firststage outputs the clustering and the pair-wise connectivity of thepatients. Section II-B describes the data integration and diseasesubtyping (stage II) using multi-omics data. This second stage consistsof 2 phases. In phase 1, the pair-wise connectivity (between patients)for multiple data types are combined to form the network betweenpatients. This network is then partitioned to determine the groupingusing the integrated data. In phase 2, we further split each group intosub-groups if possible. The output of the phase 2 is then reported asthe output of PINS using the multi-omics data.

A. Perturbation Clustering (Stage I)

In stage I of PINS, we solve the classical clustering problem, i.e., wefocus on clustering samples (patients) using one data type. The approachis based on the observation that small changes in any kind ofquantitative assay will be inherently present between individuals, evenin a truly homogeneous population in the absence of any subtypes. Here,the hypothesis is that if well-defined subtypes of disease do exist,these have to be stable with respect to small changes in the measuredvalues. Hence, we are not interested in any clusters that form ordisappear due to small changes in the data, but rather we are lookingfor those groupings that remain stable across many clusterings built inthe presence of small changes. In order to find such clusters, we addGaussian noise to the data and reconstruct the clustering many times.The stability is assessed by the discrepancies in the clustering resultsbetween the original and the perturbed data. Based on this, we extractthe “true” number of clusters as being the one that is least affected bysuch perturbations. In the absence of any true subtypes, the repeatedclusterings will show lack of stability thus allowing us to avoid thediscovery of false subtypes.

The framework will be described here using k-means clustering as thebasic building block of our subtype discovery approach, but a number ofother classical clustering approaches could be used instead. It iswell-known that the k-means algorithm may converge to a local minimumdepending on the initialization. To overcome this problem, we use the“modified version” of k-means, i.e., we run k-means many times withdifferent random initialization and then choose the result that givesthe least residual sum of squares (RSS). In the rest of this manuscript,the term k-means refers to the “modified version” of k-means.

The high-level algorithm can be briefly described as follows: i) For agiven number of clusters k, we cluster the original data using k-meansand then construct the connectivity between patients (originalconnectivity). ii) We add noise to the data and re-cluster the perturbeddata many times to determine the average connectivity between patientswhen the data are perturbed (perturbed connectivity). iii) We calculatethe discrepancy between the original clustering and the perturbedclustering for each k. iv) We repeat the above steps for all values ofkin a range of interest (e.g., 2..10). v) We choose the k which givesthe least discrepancy between the original and perturbed connectivity.The corresponding clustering is then returned as the most stable one.

Our approach differs from the existing methods in the following aspects:i) we accept the noisy nature of the biological measurement and use thedata as they are (without data pre-processing) and therefore do notsuffer from information loss and do not require a preliminary featureselection, nor a dimensionality reduction; ii) our stability metric isexpected to deterministically and reliably identify the number ofclusters present in the data.

FIG. 1 shows the detailed workflow of the perturbation clusteringalgorithm (stage I). The input of the algorithm is a dataset (matrix)I∈R_(N×M), where N is the number of patients and M is the number ofmeasurements for each patient. In the example of gene expression, N isthe number of samples and M is the number of genes (or probes) measuredin each sample. In short, the rows of the matrix I represents thepatients and the columns represents the components (features). Thealgorithm parameters are K (default 10) and H (default 200) where K isthe maximum number of clusters and H is the number of perturbation H.The algorithm consists of 11 steps, which will be described step-by-stepin the following sections.

1. Construction of Original Connectivity Matrices (Steps 1-2)

In step (1), we partition the patients using all possible number ofclusters k=[2.. K]. Formally, the input I can be presented as a set of Npatients I={e₁, e₂, . . . , e_(N)} where each element e_(i) is in aM^(th) dimensional space and represents the molecular profile of thei^(th) patient (i∈[1..N]). A partitioning P_(k) (k clusters) of I can bewritten in the form P_(k)={P₁, P₂, . . . , P_(k)} where P_(i) is a setof patients, such that ∪_(i=1) ^(k) P_(i)=I and P_(i)∩P_(j)=Ø, ∀i,j∈[1..k], i≠j. After step (1), we have (K−1) partitionings: {P₂, . . . ,P_(K)}, one for each value of k∈[2..K].

In step (2), we build the pair-wise connectivity between the patientsusing the partitionings obtained from step (1). For a partitioningP_(k), two patients are connected if they are clustered together. Webuild the connectivity matrix C_(k)∈{0,1}^(N×N) from the partitioningP_(k)={P₁, P₂, , P_(k)} as follows:

$\begin{matrix}{{C_{k}\left( {i,j} \right)} = \left\{ \begin{matrix}1 & {{{{if}\mspace{14mu} {\exists{t \in \left\lbrack {1\mspace{14mu} \ldots \mspace{14mu} k} \right\rbrack}}}:1},{j \in P_{t}}} \\0 & {otherwise}\end{matrix} \right.} & (1)\end{matrix}$

In other words, the connectivity between two patients is 1 if and onlyif they belong to the same cluster. Let us consider one example. Wecluster a set of 5 elements into 2 clusters with the resultedpartitioning P₂={{1, 2},{3, 4, 5}}. In this case, element 1 is connectedto element 2 and is not connected to other elements {3, 4, and 5}.Similarly, elements {3, 4, 5} are all connected to each other, but notto elements {1, 2}. Using equation (1), we construct the connectivitymatrix for P₂ is as follows:

$C_{2} = \begin{bmatrix}1 & 1 & 0 & 0 & 0 \\1 & 1 & 0 & 0 & 0 \\0 & 0 & 1 & 1 & 1 \\0 & 0 & 1 & 1 & 1 \\0 & 0 & 1 & 1 & 1\end{bmatrix}$

Intuitively, a partitioning can be presented as a graph in which eachpatient is a node and the connectivity between two patients is an edge,such that the edge exists if and only if the two patients have similargene expression profile and thus are clustered together. Any twopatients of a cluster are connected by an edge, and any two patients ofdifferent clusters are not connected. The connectivity matrix of isexactly the adjacency matrix of the graph.

We construct one connectivity matrix for each value of k∈[2..K]. Afterstep (2), we have (K−1) connectivity matrices C₂, . . . , C_(K). Werefer to these matrices as original connectivity matrices because theyare constructed from the original data without perturbation orresampling.

2. Generating Perturbed Datasets (Steps 3-4)

In order to assess the stability of the partitionings obtained in steps(1-2), we generate H new datasets by adding Gaussian noise to theoriginal data I. In step (3), we calculate the noise variance from theinput. We first calculate the variances of the M components (columns).For example, for gene expression assay of 20,000 genes, we will get20,000 variances, each variance for a gene represents the variability ofthat gene among the individuals. We then choose the median of the Mvariances to be the noise variance. Our reasoning is that the majorityof the genes should have similar expression across individuals. Thedifference between individuals for those genes is due to technicalvariability and individual heterogeneity. By choosing the medianvariance, we hope that our noise setting is automatically adjusted tothe noise of the system. Formally, the noise variance is calculated asfollows:

$\begin{matrix}\left\{ \begin{matrix}{{\forall_{j}{\in {\left\lbrack {1\ldots \; M} \right\rbrack \text{:}\mspace{11mu} \sigma_{j}^{2}}}} = {{var}\left\{ {{I\left( {i,j} \right)},{i \in \left\lbrack {1\; \ldots \; N} \right\rbrack}} \right\}}} \\{\sigma^{2} = {{median}\; \left\{ {\sigma_{1}^{2},\ldots \;,\sigma_{M}^{2}} \right\}}}\end{matrix} \right. & (2)\end{matrix}$

In step (4), we generate new datasets J^((h))∈R^(N×M)(h∈[1..H]) byadding Gaussian noise to the original data as follows:

J ^((h)) =I+N(0,σ²)  (3)

where σ² is calculated in equation (2). After this step, we have Hperturbed datasets J⁽¹⁾, J⁽²⁾, . . . , J^((H)). We refer to thesedatasets as perturbed datasets because they are generated by perturbingthe original data. The perturbed datasets will be used to compute theperturbed connectivity matrices in the following section.

3. Construction of Perturbed Connectivity Matrices (Steps 5-7)

In step (5), we cluster each of the H perturbed datasets using k-meanswith varying values of k∈[2..K]. For example, for k=2, we partition thedataset J⁽¹⁾ into 2 clusters and get the Q₂ ⁽¹⁾ partitioning. We performk-means with k=2 for all the H perturbed datasets and get H differentpartitionings Q₂ ⁽¹⁾, Q₂ ⁽²⁾, . . . , Q₂ for k=2. Please note that allthese perturbed datasets were generated by adding small noise to thesame input I. In the ideal case, Q₂ ⁽¹⁾, Q₂ ⁽²⁾, . . . , Q₂ ^((H)) areall identical to P₂. The more difference between them, the less reliablethe P₂ partitioning.

After step (5), we have H different partitionings Q_(k) ⁽¹⁾, Q_(k) ⁽²⁾,. . . , Q_(k) ^((H)), for each value of k∈[2..K]. In step (6), weconstruct a connectivity matrix for each partitioning created in step(5). Specifically, for the partitioning Q_(k) ^((h))(h∈[1..H],k∈[2..K]), we construct the connectivity matrix G_(k) ^((h))∈{0,1}^(N×N)as follows:

$\begin{matrix}{{G_{k}^{(h)}\left( {i,j} \right)} = \left\{ \begin{matrix}1 & {{{if}\mspace{14mu} i},{j\mspace{14mu} {belong}\mspace{14mu} {to}\mspace{14mu} {the}\mspace{14mu} {same}\mspace{14mu} {cluster}}} \\0 & {otherwise}\end{matrix} \right.} & (4)\end{matrix}$

After this step, we have H connectivity matrices G_(k) ⁽¹⁾, G_(k) ⁽²⁾, .. . , G_(k) ^((H)) for a value of k. In the context of graph, eachconnectivity matrix can be considered as the presentation of the networkbetween patients. For a given value of k, C_(k) represents the networkfor the original data while G_(k) ^((h)) represents the network betweenpatients each time we perturb the data. The stability of the clusteringis assessed based on the discrepancy between these altered networks andthe original network. We first combine the altered networks beforecomparing the combined network to the original network. In step (7), wecalculate the perturbed connectivity matrix by averaging theconnectivity from G_(k) ⁽¹⁾, G_(k) ⁽²⁾, . . . , G_(k) ^((H)) as follows:

$\begin{matrix}{A_{k} = {\frac{1}{H}{\sum\limits_{h = 1}^{H}G_{k}^{(h)}}}} & (5)\end{matrix}$

where A_(k)∈[0,1]^(N×N), k∈[2..K]. We refer to these matrices asperturbed connectivity matrices. For each value of k∈[2..K], we have oneoriginal connectivity matrix and one perturbed connectivity matrix. Thediscrepancy between the two matrix reflects the stability of thepartitioning P_(k).

4) Stability Assessment (Step 8-10)

Given the number of cluster k, we would like to quantify the discrepancybetween the C_(k) and A_(k). We calculate the difference matrixD_(k)∈[0,1]^(N×N) as follows:

D _(k) =|C _(k) −A _(k)|  (6)

D_(k)(i,j) represents the change in connectivity between i and j whenthe data are perturbed. D_(k) consists of numbers falling into theinterval [0,1]. The distribution of the entries of D_(k) reflects thestability of the clustering. The more this distribution shifts towards1, the less robust the clustering. To quantify the discrepancy, wecompute the empirical cumulative distribution function (CDT) of the forthe entries of D_(k). In step (9) we compute the function F_(k) asfollows:

$\begin{matrix}{{F_{k}(c)} = \frac{\left\{ {{{{D_{k}\left( {i,j} \right)} \leq c}i},{j \in \left\lbrack {1\mspace{14mu} \ldots \mspace{11mu} N} \right\rbrack}} \right\} }{N^{2}}} & (7)\end{matrix}$

where the numerator represents the number of elements in D_(k) that aresmaller than or equal to c while the denominator represents the totalnumber of elements in the matrix D_(k).

In step (10), we calculate the area under the curve AUC_(k) of the CDFs.When C_(k) and A_(k) are identical (i.e., data perturbation do notchange the clustering result), the difference matrix D_(k) consists ofonly 0's. In this case, F_(k)(0)=1, and thus the area under the curveAUC_(k) to be maximized, i.e., AUC_(k)=1. If C_(k) and A_(k) differ fromeach other, then the entries of D_(k) shift towards 1, making AUC_(k)smaller than 1. The more different between C_(k) and A_(k), the smallerthe AUC_(k). The smaller the AUC_(k), the more stable the partitioning.Therefore, we choose the optimal {circumflex over (k)} for which thearea under the curve (AUC) is maximized.

$\begin{matrix}{\overset{\hat{}}{k} = {\underset{k}{argmax}\; \left( {{AUC}_{k},{k \in \left\lbrack {2\mspace{11mu} \ldots \mspace{11mu} K} \right\rbrack}} \right)}} & (8)\end{matrix}$

At the end of stage I, we return the optimal value of {circumflex over(k)}, the partitioning P_({circumflex over (k)}), the originalconnectivity matrix C_({circumflex over (k)}), and the perturbedconnectivity matrix A_({circumflex over (k)}). The connectivity matricesC_({circumflex over (k)}), A_({circumflex over (k)}) represent thenetwork between patients for one data type. These matrices will be usedto combine multi-omics data for the final clustering in stage II ofPINS.

To illustrate the workflow of the algorithm, we simulate 10 simulateddatasets named Gaussian1, Gaussian2, . . . , Gaussian10. The number ineach name is the number of classes of the dataset. Each dataset has 100samples and 1,000 genes. The samples are equally divided among theclasses. For example Gaussian2 has 2 classes of size 50 and Gaussian3has 3 classes of size 33 and 34. We will show that the AUC values arenotably low for Gaussian1 dataset, which suggests that any partitioningof this dataset is very unstable against data perturbation (FIG. 2). Forthe other 9 datasets, we will show that the partitioning is stable whenthe number of cluster equals the true number of classes (FIGS. 3, 4, 5,6, 7).

FIG. 2 shows the workflow of PINS for the simulated dataset Gaussian1.The dataset consists of 100 samples and 1, 000 genes. The expressionvalues of each gene follow the Gaussian distribution N(0,1) as shown inpanel (A). From the data, we calculate the variance of each gene. Wehave σ_(i) ²≈1, ∀i∈[1..1000], and therefore σ²≈1. We note that thevariance of the distribution has no impact on the result of PINS becausethe noise variance is set to be the median variance of the genes.

FIG. 2B shows the original connectivity matrices (upper row) andperturbed connectivity matrices (lower row). For each value of k, PINSpartitions the original data and then builds the connectivity matrix.The elements in one cluster are all connected to each other and aredisconnected to elements of other clusters. For example, when k=2, PINSforms 2 clusters of approximately equal sizes from the original data.However, when the data are perturbed, each data point randomly movesaround its original location and thus it can be grouped together withany other point with the same probability. By perturbing the data, weconstruct 200 connectivity matrices G₂ ^((h)), h∈[1..200]. The perturbedconnectivity matrix is then calculated as the average of these 200matrices:

$A_{2} = \frac{\Sigma_{h = 1}^{200}G_{2}^{(h)}}{200}$

Visually, the perturbed connectivity matrix A₂ in panel (B) shows thatdata points are randomly connected. This is also true for any othervalue of k∈[2..10]. In summary, the original connectivity greatlydisagree with the perturbed connectivity, which reflects the realstructure of the data.

FIG. 2C displays the CDFs of the entries of the difference matrices forall values of k∈[2..10]. The horizontal axis represents the entries ofthe difference matrix while the vertical axis represents the values ofthe CDFs. Panel (D) shows the area under the curve (AUC) of the CDFs.The horizontal axis shows the number of clusters and the vertical axisshows the AUC values. To understand the variability of the AUC values,we repeat the whole process 20 times with different simulated datasetshaving normally distributed gene expression. The vertical lines show the95% confidence interval of the AUCs at each value of k. The AUC valuesbarely change when the data change. We also plot the AUC values for asimulated dataset with uniformly distributed expression values. Thefigure shows that when the data are random, regardless of theirdistribution, the AUC values vary only slightly. In addition, these AUCvalues monotonically increase with k, and range from 0.5 to 0.85, whichis notably smaller than 1.

As we understand the behavior of PINS for random data, we would like toknow how PINS works on datasets that have separable classes. FIG. 3displays the workflow of PINS for the simulated dataset Gaussian2 (2classes). The dataset consists of 100 samples and 1,000 genes. Panel (A)shows the gene expression of the 2 cluster, in which each cluster has 50samples. The samples of the first cluster has the genes 1-100up-regulated while the second cluster has the genes 101-200up-regulated. These up-regulated genes are normally distributed withmean 2 and variance 1 (N (2, 1)). Other genes are normally distributedwith mean 0 and variance 1(N(0, 1)).

FIG. 3B shows the original connectivity matrices (upper row) and theperturbed connectivity matrices (lower row) of the simulated datasetGaussian2. Using the original data, the basic k-means algorithmcorrectly separate the 2 classes when k=2. As we perturb the data, eachdata point moves around its original position but still stays close toits own cluster. Therefore, samples of the same cluster are stillgrouped together, making the perturbed connectivity matrix identical tothe original connectivity matrix when k=2. When k>2, the originalconnectivity matrices show that the k-means algorithm further split thedata into smaller groups. However, when the data are perturbed, theconnectivity between data points of the same cluster, which weremistakenly separated, tend to recover. Regardless of the value of kbeing used, the perturbed connectivity matrices clearly show that thedata consists of 2 clusters, which is the true structure of the datasetGaussian2. Panel (C) shows the CDFs of the difference matrix while panel(D) shows the AUC values of the CDFs. Since the original and perturbedconnectivity matrices are identical for k=2, we have F₂(0)=1 and AUC₂=1.In other words, P₂ is the only partitioning that is stable against dataperturbation, and therefore {circumflex over (k)}=2 is the optimalnumber of clusters for the dataset Gaussian2. PINS correctly anddeterministically discovers the true classes of the dataset Gaussian2.

FIG. 4 displays the workflow of PINS for the simulated datasetGaussian3. Panel (A) shows the expression of the 3 classes. Each of thefirst and second classes have 33 samples while the third class has 34samples (totally 100 samples). The first class has the genes 1-100up-regulated; the second class has the genes 101-200 up-regulated; thethird class has the genes 201-300. These up-regulated genes are normallydistributed with mean 2 and variance 1 (N(2, 1)). Other genes arenormally distributed with mean 0 and variance 1(N(0, 1)).

FIG. 4B shows the original connectivity matrices (upper row) andperturbed connectivity matrices (lower row). For k=3, the basic k-meansalgorithm correctly separate the 3 classes using the original data. Aswe perturb the data, samples of the same class are still groupedtogether, making the perturbed connectivity matrix identical to theoriginal connectivity matrix. For k>3, the k-means algorithm furthersplits each class into smaller groups. However, when the data areperturbed, samples of the same class tend to connect to each other. Fork=2, the original connectivity matrix C₂ shows that 2 of the 3 classesare merged but the connectivity between them is not stable when the dataare perturbed. The perturbed connectivity matrices clearly suggest thatthe data consists 3 groups of samples, which is the true structure ofGaussian3.

FIG. 4C displays the empirical cumulative distribution functions (CDF)F_(k) of the difference matrix D_(k), k∈[2..10]. The horizontal axisrepresents the entries of the difference matrix while the vertical axisdisplays the values of the function (the number of elements in D_(k)smaller or equal to each entry). Panel (D) shows the area under thecurve (AUC) of the CDFs. The horizontal axis shows the number ofclusters and the vertical axis shows the AUC values. The AUC curve showsthat only the partitioning P₃ is stable against data perturbation, i.e.,{circumflex over (k)}=3. PINS correctly and deterministically discoversthe true classes of the dataset Gaussian3.

Similarly, FIGS. 5 and 6 display the workflow of PINS for the simulateddatasets Gaussian5 (5 classes) and Gaussian9 (9 classes). In both cases,the perturbed connectivity matrices clearly show the true structure ofthe data, regardless of the value of k being used. We note that forthese two datasets, the noise variance is set to the median variance ofthe genes, which can be higher than the real noise. For example withGaussian9, only that last 100 genes have variance equal to the noisevariance. The other 900 genes have variance higher than the noisevariance because there is at least one cluster having those genesup-regulated. Even in these cases, PINS still correctly identify thenumber of clusters with the optimal AUC equals to 1.

As a summary, we display the AUC values of all the 10 simulated datasetsfrom Gaussian1 to Gaussian10 in FIG. 7. The number in the name of eachsimulated dataset is the number of classes in that dataset. When thedata have no structure as in Gaussian1, the AUC values monotonicallyincrease with k, and range from 0.5 to 0.85. These AUC values vary onlyslightly regardless of the gene expression variance or distribution(FIG. 2). In addition, for any value of k, the AUC value of Gaussian1 isalways smaller than the AUC value of any other dataset that has a clearstructure. When the data consist of more than 1 class, the AUC valuesgreatly increase and reach the maximum value when the number of clusterequals to the number of classes. PINS correctly identify the optimalnumber of clusters {circumflex over (k)} withAUC_({circumflex over (k)})=1 for all these 9 simulated datasets.

FIG. 8 shows and example of the real dataset GSE19188PINS result for thereal dataset GSE19188, which consists of 91 lung cancer samples and19,851 genes. The dataset has 3 subtypes: 45 adenocarcinomas (ADC), 19large cell carcinoma (LCC), and 27 squamous cell carcinomas (SCC). Thegoal is to cluster the samples according to their subtypes using thegene expression. Panel (A) shows the connectivity matrices for k=3 andk=6. Visually, the perturbed and original connectivity matrices arealmost identical for k=3 and are greatly different for k=6. Panel (A)displays the CDF of the different matrices for k∈[2..10]. The CDF fork=3 reaches its maximum quickly, which reflects the fact that thepartitioning P₃ is the most stable among other partitionings. Panel (C)displays the AUC values, in which AUC₃ has the highest value and thus{circumflex over (k)}=3. Panel (D) displays the clustering result in thefirst two principal components. The circles represent the LCC samples;the triangles represent the ADC samples; the crosses represent the SCCsamples. PINS correctly identifies the number of the subtypes andseparate most of the samples accordingly with high Rand index (RI) andadjusted Rand index (ARI). More details about the real datasets will beexplained in the Experimental Results section below.

Before moving to a detailed discussion of how subtyping of multi-omicsdata are processed, FIG. 30 will be used to summarize what has beendescribed in detail above and also to illustrate how multi-omics datacan be introduced into the analysis. The process begins by supplying adataset, such as dataset 10, which represents a single data type (suchas mRNA data), or dataset 12, which represents a collection of pluraldatasets, each of a different data type (such as mRNA, DNA methylation,and miRNA). The dataset can be generated by collecting human tissuesamples, analyzing those samples using a laboratory genetic analyzer andstoring in a M×N array in non-transitory computer memory, where Mrepresents an element of the data type and N represents the individualperson from whom the sample was collected. Suitable datasets can beobtained from previously developed sources available commercially and/orfrom the Internet.

Whether a single dataset 10 or a multi-omics dataset 12 is being used,the dataset is processed to construct the original connectivity matrices14, as described in detail above. In parallel with the originalconnectivity matrices construction, perturbed datasets are generated asat 16. This is done, as described above by injecting a suitablyconfigured Gaussian noise into the data. The perturbed datasets are thenused to construct perturbed connectivity matrices 18.

With original and perturbed connectivity matrices now both constructed,a stability assessment is performed at 20. As a result of thisassessment, the computer-implemented algorithm identifies severalimportant data values that describe the optimal clusters for the givendatasets. These data values include, the optimal value of k, designatingthe optimal number of clusters; and the optimal partitioning of thedataset, indicating to which cluster each person's data belongs. Inaddition, the algorithm stores the original connectivity matrix and theperturbed connectivity matrix, for use in subsequent processing steps.Note that the stability assessment step defines these data values foreach dataset supplied. Thus if the dataset 10 was used, a single optimalk value and single optimal partition would be generated. If dataset 12were instead used, the stability assessment step would generate aseparate optimal k value for each data type (e.g., mRNA, DNAmethylation, and miRNA) and a separate optimal partition for each datatype as well.

Note that the process illustrated in FIG. 30 represents unclassifiedclustering. The process finds the optimal clusters without requiring anya priori knowledge of how the individual subjects may have beenclassified (if at all) prior to performing the process. Essentially, theprocess describe in FIG. 30 and above, uses the raw dataset data to findthe optimal clusters, without any requirement that a k value be selectedin advance.

B. Subtyping Multi-Omics Data

In this section, we describe the workflow of PINS for multi-omics data.Let us denote T as the number of data types. The input of PINS is a setof T matrices where ∥={I₁, I₂, . . . , I_(T)} where I_(i)∈R^(N×M) ^(i)represents the measurements of the i^(th) data type, N is the number ofpatients, and M_(i) is the number of measurements per patient for thei^(th) data type. The T matrices have the same number of rows (patients)but might have different number of columns.

Our disclosed workflow consists of two stages: i) integrate the data andcluster the patients, ii) further split each group into subgroups ifpossible. In stage I, we construct the combined similarity matrixbetween patients using the connectivity information from individual datatypes. We then combine 3 similarity-based algorithms to determine thefinal partitioning of the multi-omics data. In stage II, we furthersplit each discovered subtype if possible.

1) Stage I—Data Integration and Subtyping:

The algorithm starts by clustering each data type using the perturbationclustering (as described above). Consider the i^(th) data type with thedata matrix I_(i). The perturbation clustering estimates {circumflexover (k)}_(i) as the number of clusters and then partitions the datainto {circumflex over (k)}_(i) clusters. The algorithm returns theoriginal connectivity matrix C_(i), in which the connectivity betweenelements of the same clusters is 1 and the connectivity between elementsof different clusters is 0. Please note that the index i here denotesthe index of the data type. For T data types, we have T originalconnectivity matrices C₁; C₂; . . . ; C_(T). We combine the connectivitymatrices for the original data as follows:

$\begin{matrix}{S_{C} = \frac{\Sigma_{i = 1}^{T}C_{i}}{T}} & (9)\end{matrix}$

We refer to SC as the original similarity matrix because it isconstructed from the original connectivity matrices. If we consider eachpatient as a node, and the connectivity between two patients is an edge,then each connectivity matrix for each data type represents a graph. Ourgoal is to identify subgraphs that are strongly connected across alldata types.

We then measure the agreement between the T data types using the conceptsimilar to the pair-wise agreement of Rand index (RI). Given twopartitionings of the same set of items, the RI is calculated as thenumber of pairs that “agree”, divided by the total number of possiblepairs. A pair “agrees” if the two samples are either grouped together inboth partitionings or they are separated in both partitionings. Weextend this concept to T partitionings of T data types. First we definedthat the connectivity between two patients is consistent if it does notchange across data types. We then define the agreement of T data typesas the number of pairs having consistent connectivity, divided by thetotal number of possible pairs. In other words, the agreement betweenthe data types can be calculated as follows:

$\begin{matrix}{{{agree}\mspace{11mu} \left( S_{C} \right)} = \frac{\left\{ {{S_{C}\left( {i,j} \right)} = {{0\bigvee{S_{C}\left( {i,j} \right)}} = 1}} \right\} }{N^{2}}} & (10)\end{matrix}$

If the majority of pairs are consistent, i.e., agree(S_(C))>50%, we saythat the T data types have strong agreement. In this case, we define astrong similarity matrix Ŝ_(C) as follows:

$\begin{matrix}{{{\hat{S}}_{C}\left( {i,j} \right)} = \left\{ \begin{matrix}1 & {{{if}\mspace{14mu} {S_{C}\left( {i,j} \right)}} = 1} \\0 & {otherwise}\end{matrix} \right.} & (11)\end{matrix}$

where Ŝ_(C)(i, j)=1 if and only if i and j are clustered together in alldata types and 0 otherwise. A hierarchical clustering is then applied onthis matrix and the resulting tree is cut at the height that providesmaximum cluster separation.

When the data types do not have strong agreement, we perform a clusterensemble of 3 different methods as will be explained as follows. Thematrix {S_(C)(i, j)} represents the similarity between patients, andtherefore {1−Ŝ_(C)(i, j)} represents the pair-wise distance betweenpatients, which can be directly used by similarity-based clusteringalgorithms, such as hierarchical clustering, partitioning aroundmedoids, or dynamic tree cut. Here we use all the 3 algorithms topartition the patients and then choose the partition that agrees themost with the partitionings of individual data types. The dynamic treecut algorithm can automatically determines the number of clusters, butthe other two algorithms, hierarchical clustering (HC) and partitioningaround medoids (PAM), need us to provide the number of clusters.

To determine the number of clusters for HC and PAM, we introduce theperturbed similarity matrix, which can be calculated as follows:

$\begin{matrix}{S_{A} = \frac{\Sigma_{i = 1}^{T}A_{i}}{T}} & (12)\end{matrix}$

where A_(i) is the perturbed connectivity matrix of the i^(th) datatype. Please note that SC is constructed by averaging the originalconnectivity of T data types while S_(A) is constructed by averaging theperturbed connectivity of T data types. We use this both matrices toassess the stability of HC and PAM.

For hierarchical clustering, we first build the H₁ tree using theoriginal similarity matrix S_(C), and then we build the H₂ tree usingthe perturbed similarity matrix S_(A). For each value of k, we cut H₁ toget k clusters and then build the connectivity matrix. We do the samefor H₂ and then calculate the instability d_(k) as the sum of absolutedifference between the two connectivity matrices. We choose k for whichthe d_(k) is the smallest, i.e., {circumflex over(k)}=argmax_(k)(d_(k),k∈[2..K]).

Similarly for PAM, we partition the patients using both original andperturbed similarity matrices. For each value of k, we have onepartitioning using the original similarity matrix S_(C) and onepartitioning using the perturbed similarity matrix S_(A). We build theconnectivity matrices from the two partitioning and then calculate theinstability d_(k) as the absolute difference between the twoconnectivity matrices. We choose {circumflex over (k)} for which thed_(k) is the smallest, i.e., {circumflex over(k)}=argmax_(k)(d_(k),k∈[2..K])

After having the 3 partitionings using the 3 similarity-based clusteringalgorithms, we calculate the agreement between each partitioning and theT data types. Again, we use the agreement concept introduced in Equation(10). For each algorithm, we calculate the agreement between itspartitioning and the T partitionings for the T data types. We thenchoose the result of the algorithm that has the highest agreement withthe T data types.

2) Stage II—Splitting Groups into Subgroups

In stage II, we further split one discovered group of patients at atime, if possible. We check each group independently. If a group hasmore than ⅔ of the total samples, we run the procedure described instage I again, but this time the input consists of only the patientsbelonging to the group we are working on. The goal is to separatesamples of this group, that would not be possible with the presence ofsamples from other groups.

If a group has less than ⅔ but more than ⅓ of the total samples, we needto check the agreement between the T data types. We take intoconsideration only the samples belonging to this group. We cluster eachdata type and build the T connectivity matrices. Here we calculate theagreement between the data types using Equation (10). If the agreementis more than 50% (i.e., the majority of pairs agree across all datatypes), we further split the group. Otherwise, the group is not split.

FIG. 9 displays the subtyping of where the workflow goes through bothstages. In stage I, we first cluster each data type independently andthen build the corresponding connectivity matrices (panel 9A). We thencompute the combined similarity matrix, which is the averageconnectivity across all the data types. For this dataset, the data typeshave strong connectivity (>50% agreement) and thus we use the strongsimilarity matrix to determine the final partitioning (panel 9B). Weperform a hierarchical clustering on the strong connectivity matrix. Thestructure of the data is well defined, so using any linkage would returnthe same tree. As customary, the tree is cut where the height is themost different (dashed line in panel 9B) yielding 3 groups.

The Kaplan-Meier survival curves for these groups are shown in FIG. 9C.In stage II we check if the discovered groups can be further split intosubgroups. Group 1 has more than ⅓ of total samples and thus can beconsidered for further splitting. The connectivity matrices of samplesbelonging to group 1 also have strong agreement (>50% agreement).Therefore, this group is further split into 2 subgroups. Group 2 alsohas more than ⅓ of total samples, but the connectivity matrices do nothave strong agreement and thus this group is not split further.

The survival curves of the final partitioning is displayed in FIG. 9D.We note that although the subtype discovery was done on molecular dataalone, with no use of clinical information, the 4 groups identified havesignificantly different clinical profiles: groups 1-1 containsshort-term survival women, group 1-2 contains longer survival women,group 2 contains only men, and group 3 survival (all patients that werestill alive at the end of the study). The survival analysis indicatesthat these groups have very significantly different survival profiles(Cox p−value 1.3×10⁻⁴)

Remarkably, the significantly different groups can be obtained only whenthe 3 types of data are integrated and analyzed together. PINS cannotfind subgroups with significantly different survival for any one of thesingle data types: mRNA, methylation, and miRNA (more details in TableIV in section III-B). However, when all types of data are integrated byour approach, the p-value of the obtained subtypes becomes 4 orders ofmagnitude more significant.

The above-described computer-implemented algorithm for subtypingmulti-omics data will now be summarized with reference to FIGS. 31a and31b . FIG. 31a depicts the Stage I procedure detailed above; FIG. 31bdepicts the Stage II procedure detailed above. Referring first to FIG.31a , the Stage I procedure begins as 22. The original connectivitymatrix 23, comprising original connectivity matrices for each of theplural data types, are averaged as at 24 using Equation (9) detailedabove. This produces the original similarity matrix 25. The State Ialgorithm examines this original similarity matrix to determine if thereis strong agreement among the plural data types. While any numericmeasure of strong agreement can be used, the preferred embodimentdetermines agreement to be strong if there is agreement among more than50% among the data types. If the data shows strong agreement, ahierarchical clustering algorithm is applied to the data as at 27. Thisproduces a hierarchical cluster “tree” which is then cut as at 28 formaximum separation. The result of such clustering is then passed on toStage II.

If at step 26 there is not strong agreement, the algorithm appliesplural different clustering algorithms, as indicated diagrammatically at29. These plural algorithms are effectively run in parallel. While anumber of different clustering algorithms may be utilized, for purposesof explaining the technique, FIG. 31a illustrates three clusteringalgorithms: hierarchical clustering 30, partitioning around medoids 31and dynamic tree cut 32. The results of each of these plural clusteringalgorithms are individually examined at 33, to measure the degree ofagreement between each clustering partition and the data types.Agreement is assessed essentially the same as in step 26, using Equation(10) detailed above. The output of the clustering algorithm thatproduces the highest agreement is selected to be passed on to Stage II.

Referring now to FIG. 31b , the Stage II algorithm begins at 34. Thealgorithm is performed for each cluster group passed to it from Stage I.Thus each group is examined for size. If the size is deemed large as at35, it is sent back to Stage I at step 36 to be subdivided. While anysuitable size metric can be used, the current embodiment considers agroup to be large if it contains more than two-thirds of the totalsamples. A group is considered to be of medium size if it contains morethan one-third of the total samples, but less than two-thirds thereof.

If the group is deemed to be of medium size, at step 37, some additionalprocessing is performed. However, if the group is already below themedium size (e.g., less than one-third of the total samples, it will beretained as-is, without requiring further subdividing, as indicated at41. However if the group is of medium size, the algorithm checks at step38 to check agreement among data types. If agreement among data types ishigh (e.g., greater than 50% agreement), as detected at 39, the group iseligible to be further subdivided. Thus at step 40, the group is sentback to be further subdivided. On the other hand, if agreement amongdata types is not high the group is left un-split as at 41.

III. EXPERIMENTAL STUDIES

Our experimental studies include a wide range of cancers using a singledata type as well as using multi-omics data. To evaluate theperturbation clustering (stage I) using a single data type, we download8 gene expression datasets with known classes (subtypes) from GeneExpression Omnibus and Broad Institute websites. The performance of eachclustering method is assessed by comparing their partitionings againstthe true classes of each dataset. To evaluate the data integration(stage II) of multi-omics data, we download mRNA, methylation, and miRNAdata of 6 different cancers from The Cancer Genome Atlas (TCGA) website.The performance of the clustering methods are assessed by comparing thesurvival of the patients.

A. Experimental Studies Using Gene Expression Data

In this section we assess the performance of PINS in clustering a singledata type (stage I). Details of the 8 gene expression datasets aredescribed in Table I. The 5 datasets GSE10245, GSE19188, GSE43580,GSE15061, and GSE14924 were downloaded from Gene Expression Omnibus(http://www.ncbi.nlm.nih.gov/geo/). The 3 datasets Lung2001(http://www.broadinstitute.org/mpr/lung/), AML2004(http://www.broadinstitute.org/cancer/pub/nmf), and Brain2002(http://www.broadinstitute.org/MPR/CNS/) were downloaded from thecorresponding websites of Broad Institute.

TABLE I Description of 8 Gene Expression Datasets Used in ExperimentalStudies Class Sample Component Chip Sample Datasets Number Number NumberType Description GSE10245 2 58 19851 hgu133plus2 40 adenocarcinomas and18 squamous cell carcinomas GSE19188 3 91 19851 hgu133plus2 45adenocarcinomas, 19 large cell carcinomas, and 27 squamous cellcarcinomas GSE43580 2 150 19851 hgu133plus2 77 adenocarcinomas and 73squamous cell carcinomas GSE14924 2 20 19851 hgu133plus2 10 acutemyeloid leukemia CD4 T cell and 10 CD8 T cell AML2004 3 38 5000 hgu680011 acute myeloid leukemia, 19 acute lymphoblastic leukemia B cell, and 8T cell GSE15061 2 366 19851 hgu133plus2 202 acute myeloid leukemiasamples and 164 myelodyplastic syndrome samples Lung2001 4 237 8641hgu95a 190 adenocarcinomas, 21 squamous cell carcinomas, 20 carcinoid,and 6 small-cell lung carcinomas Brain2002 5 42 5299 hgu6800 10meduloblastomas, 10 malignant gliolas, 10 atypical teratoid/rhaboidtumors, 4 normal cerebellums, and 8 primitive neuroectodermal tumors

We compare the performance of PINS with the performance of the other 2state-of-the-art clustering algorithms, Consensus Clustering (CC), andSimilarity Network Fusion (SNF). The range of the number of clusters kis set to [2..10] for all 3 clustering algorithms. A suitable computerimplementation of Consensus Clustering is found in the R statisticalsoftware package (ConsensusClusterPlus version 1.18.0). The code for CCis run according to get the change in area under the curve Δ(k) when thenumber of clusters k increases. We choose the number of clusters{circumflex over (k)} where the CDF levels off and the correspondingΔ({circumflex over (k)}) gets close to zero, according to the classicalCC manuscript.

Regarding Similarity Network Fusion (SNF), although SNF focuses on thedata integration, it also provides an option to cluster a single datatype. The R package of SNF tool version 2.1 were downloaded fromBioconductor website. The code is run according to the description ofthe software. We calculate the number of clusters for SNF using thefunction estimateNumberOfClustersGivenGraph with the range set to[2..10]. This function returns 4 possible choices. We use the first oneas the number of clusters in this study.

For all the 8 gene expression datasets, we know the true labels(subtypes) of each sample. Therefore, we use Rand Index (RI) andadjusted Rand Index (ARI) as the metrics to assess the agreement betweenthe clustering and the ground truth (true classes of the elements).Briefly, Rand Index of 2 partitionings the number of pairs that agreedivided by the total number of pairs. In short,

${RI} = \frac{a + b}{\begin{pmatrix}N \\2\end{pmatrix}}$

where a is the number of pairs that are clustered together in bothpartitionings, b is the number of pairs that are separated in bothpartitionings, and

$\quad\begin{pmatrix}N \\2\end{pmatrix}$

is the total possible pairs from N elements. The adjusted Rand index(ARI) is the corrected-for-chance version of the Rand index (AppendixA). The clustering results are calculated using all genes withoutfiltering. However, for illustration purpose, the clustering resultswill be displayed only in the first 2 principal components.

FIG. 10 displays the clustering results of the dataset GSE10245. Thedataset consists of 58 non-small cell lung cancer samples of 2 classes:40 adenocarcinomas (ADC), and 18 squamous cell carcinomas (SCC). Fromleft to right are the results of PINS, Similarity Network Fusion (SNF),and Consensus Clustering (CC) in the first 2 principal components.Different shapes of the points represent different classes whiledifferent colors represent different clusters in the results. Visually,the classes are separable with an exception of only some samples. Usingperturbing the data, PINS recognizes that the clustering is the moststable with {circumflex over (k)}=2 and then correctly separate most ofthe samples. Using eigen-gaps, SNF correctly identifies the number ofclasses but misclassifies many ADC samples. Using sub-sampling, CCidentifies the stability at k=6 and then splits the 2 classes to smallergroups of samples. In summary, PINS achieves the best performance amongthe clustering methods (with ARI 0.08 compared to 0.38 and 0.32 of SNFand CC)

FIG. 11 displays the clustering results of the dataset GSE19188. Thedataset consists of 91 non-small cell lung cancer samples of 3 classes:45 adenocarcinomas (ADC), 19 large cell carcinomas (LCC), and 27squamous cell carcinomas (SCC). From left to right are the results ofPINS, SNF, and CC in the first 2 principal components. Different shapesof the points represent different classes while different colorsrepresent different clusters in the results. PINS correctly recognizesthat the clustering is the most stable against data perturbation when({circumflex over (k)}=3). Both SNF and CC divides the samples into 4clusters but CC has a much higher ARI than SNF. In overall, PINS has thehighest adjusted Rand index (ARI), which is 0.66 compared to 0.12 and0.6 of SNF and CC.

FIG. 12 displays the clustering results of the dataset GSE43580. Thedataset consists of 150 non-small cell lung cancer samples: 77adenocarcinomas (ADC) and 73 squamous cell carcinomas (SCC). From leftto right are the results of PINS, SNF, and CC in the first 2 principalcomponents. Different shapes of the points represent different classeswhile different colors represent different clusters in the results. BothPINS and SNF correctly identify the number of classes while CC dividesthe samples into 3 clusters. Due to the complex nature of the data, all3 clustering methods fail to separate the samples of different classes,resulted in low ARI. In overall, PINS has the highest adjusted Randindex (ARI), which is 0.44 compared to 0.15 and 0.37 of SNF and CC.

FIG. 13 displays the clustering results of the dataset GSE14924. Thedataset consists of 2 classes: 10 acute myeloid leukemia CD4 T cells and4 CD8 T cells. From left to right are the results of PINS, SNF, and CCin the first 2 principal components. Different shapes of the pointsrepresent different classes while different colors represent differentclusters in the results. PINS correctly identifies the number of classesand perfectly separate the samples with ARI=1. SNF returns an errormessage without any clustering result. CC divides the samples into 7clusters, which is much higher than the real number of classes.

FIG. 14 displays the clustering results of the dataset GSE15061. Thedataset consists of 366 leukemia samples of 2 classes: 202 acute myeloidleukemia samples and 164 myelodyplastic syndrome samples. Both PINS andSNF correctly identify the number of classes but PINS has much higherARI than that of SNF. CC divides the samples into 7 clusters, which ismuch higher than the true number of classes. PINS has the highestadjusted Rand index (ARI), which is 0.65 compared to 0.05 and 0.43 ofSNF and CC.

FIG. 15 displays the clustering results of the dataset AML2004. Thedataset consists of 38 samples of 3 classes: 11 acute myeloid leukemia(AML), 19 acute lymphoblastic leukemia B cells (ALL_Bcell), and 8 Tcells (ALL_Tcell). From left to right are the results of PINS, SNF, andCC in the first 2 principal components. Different shapes of the pointsrepresent different classes while different colors represent differentclusters in the results. In the first 2 principal components, the AMLsamples can be separated from the rest with an exception one AML samplethat is coordinated very close to ALL_Bcell samples. The samples classesALL_Tcell and ALL_Bcell stay close to each other and hard to beseparated. None of the mentioned methods discovers the number of classesin the this dataset. PINS recognizes that the clustering is the moststable with {circumflex over (k)}=4. It separates the AML and ALL_Tcellclasses accordingly, but also splits the ALL_Tcell class into 2clusters. SNF divides the dataset into 2 clusters, resulted in thelowest ARI. Similar to PINS, CC separate the AML and ALL_Tcell classeswell but also splits the ALL_Bcell into 3 clusters. In overall, PINS hasthe highest adjusted Rand index (ARI), which is 0.65 compared to 0.17and 0.56 of SNF and CC.

FIG. 16 displays the clustering results of the dataset Lung2001. Thedataset consists of 237 samples of 4 classes: 190 adenocarcinomas(ADENO), 21 squamous cell carcinomas (SQUAMOUS), 20 carcinoids(CARCINOID), and 6 small-cell lung carcinomas (SMALL_CELL). From left toright are the results of PINS, SNF, and CC in the first 2 principalcomponents. Different shapes of the points represent different classeswhile different colors represent different clusters in the results. TheCARCINOID class stands out from the rest, but the other 3 are mixedtogether and hard to separate. PINS recognizes the stability when thepartitioning has 2 clusters, one consists of the CARCINOID samples whilethe another one consists of all other samples. SNF separates theCARCINOID samples and split the rest into 2 clusters. However, each ofthese 2 clusters consists of a mixture of ADENO, SMALL_CELL, andSQUAMOUS, resulting in a lower ARI. CC splits the samples into 8clusters, which is much higher than the number of classes. In addition,each of the cluster consists of a mixture of some classes. In overall,PINS has the highest adjusted Rand index (ARI), which is 0.54 comparedto 0.28 and 0.11 of SNF and CC.

FIG. 17 displays the clustering result for the dataset Brain2002. Thedataset consists of 42 samples of 5 classes: 10 meduloblastomas(Brain_MD), 10 malignant gliolas (Brain_Mglio), 10 atypicalteratoid/rhaboid tumors (Brain_Rhab), 4 normal cerebellums (Brain_Ncer),and 8 primitive neuroectodermal tumors (Brain PNET). From left to rightare the results of PINS, SNF, and CC in the first 2 principalcomponents. Different shapes of the points represent different classeswhile different colors represent different clusters in the results. PINSrecognizes stability with {circumflex over (k)}=8, which is more thanthe number of classes. It successfully separates the most of samplesfrom classes Brain_MD, Brain_Rhab, Brain_Ncer, and Brain_Mglio butsplits the rest to many clusters. SNF divides all the samples into 2clusters for each of which is a mixture of many classes. CC discoversthe number of classes. It separates the most samples of classesBrain_Rhab and Brain_MD but fail to separate the rest of the samplesfrom the remaining 3 classes. In overall, PINS has the highest adjustedRand index (ARI), which is 0.61 compared to 0.13 and 0.46 of SNF and CC.

The summary of all results is shown in Table II. The first 3 columns inthe table show the names, sample numbers, and true class numbers of thedata sets. The next 3 columns show the number of clusters, RI, and ARIfor the clustering results of PINS. The last 6 columns show those of CCand SNF. For each dataset (row), cells highlighted in green have thehighest RI and ARI. We put NA in the result of SNF for dataset GSE14924because it returns an error message without a result. For all the 8datasets, PINS achieves higher clustering performance than SNF and CC.

TABLE II Performance of PINS, Consensus Clustering (CC), and SimilarityNetwork Fusion (SNF) Using Gene Expression Datasets Dataset PINS SNF CCName #Sample #Class #Cluster RI ARI #Cluster RI ARI #Cluster RI ARIGSE10245  58 2 2 0.90 0.80 2 0.67 0.33 6 0.64 0.32 GSE19188  91 3 3 0.840.66 2 0.58 0.16 4 0.82 0.60 GSE43580 150 2 2 0.72 0.44 2 0.57 0.14 30.68 0.37 GSE15061 366 2 2 0.83 0.65 2 0.54 0.08 6 0.72 0.43 GSE14924 20 2 2 1.00 1.00 NA NA NA 7 0.64 0.25 Lung2001 237 4 2 0.82 0.54 3 0.620.28 8 0.44 0.11 AML2004  38 3 4 0.85 0.66 2 0.59 0.17 5 0.81 0.56Brain2002  42 5 7 0.89 0.61 2 0.57 0.13 5 0.80 0.46

B. Experimental Studies Using Multi-Omics Data

1) Analysis Across a Wide Spectrum of Cancer:

We downloaded 6 different cancer datasets from The Cancer Genome Atlas(TCGA): glioblastoma multiform (GBM), lung squamous cell carcinoma(LUSC), breast invasive carcinoma (BRCA), acute myeloid leukemia (LAML),kidney renal clear cell carcinoma (KIRC), and colon adenocarcinoma(COAD). For each cancer dataset, we downloaded TCGA-curated level 3 dataof mRNA expression, DNA methylation, and mRNA expression. We analyze theset of patients that have measurements across all the 3 data types. TCGAcontains different platforms for each data type. We choose the platformsof each data type so that they have the largest set of common patients.

Table III displays details of the data for the 6 cancer datasets. Thenumber of samples is the set of patients that have measurements acrossall the 3 data types. The number of component for a data type is thenumber of measurements for a patient for that data type. The expressionvalues of DNA methylation fall between 0 and 1 and the expression valuesof microarray measurements (gene expression) fall between 2 and 14. Weuse these data as they are without processing. For sequencing data,since the values are too large (up to millions), we use their logtransformation (base 2).

TABLE III Description of the 6 Datasets Dataset # Sample Data Type #Components Platform Data Level KIRC 124 mRNA 17974 Illumina HiSeq RNASeq3 Methylation 23165 HumanMethylation27 3 miRNA 590 Illumina GASeqmiRNASeq 3 GBM 273 mRNA 12042 HT HG-U133A 3 Methylation 22833HumanMethylation27 3 miRNA 534 Illumina HiSeq miRNASeq 3 LUSC 110 mRNA12042 HT HG-U133A 3 Methylation 23348 HumanMethylation27 3 miRNA 706Illumina GASeq miRNASeq 3 BRCA 172 mRNA 20100 Illumina HiSeq RNASeqV2 3Methylation 22533 HumanMethylation27 3 miRNA 718 Illumina GASeq miRNASeq3 LAML 164 mRNA 16818 Illumina GASeq RNASeq 3 Methylation 22833HumanMethylation27 3 miRNA 552 Illumina GASeq miRNASeq 3 COAD 146 mRNA17062 Illumina GASeq RNASeq 3 Methylation 24454 HumanMethylation27 3miRNA 710 Illumina GASeq miRNASeq 3

Since the Consensus Clustering (CC) does not have the functionality tointegrate multiple data types, we compare PINS against SNF in thissection. Using the clinical data from TCGA, we calculate the Coxlog-rank test p-values for the each partitioning. We note that Coxp-values were also used to assess clustering performance for SNF. Wereport the number of clusters and Cox p-values for each data type aswell as for the integrated data in Table IV. The first 3 columnsdescribe the data while the next 4 columns show the number of clustersand Cox p-value for PINS and SNF. The results for the integrated dataare displayed in bold. The cells highlighted in green have significantp-values (cutoff 0.05). SNF gives significant p-value for only LAMLwhile PINS gives significant p-values for KIRC, GBM, LUSC, BRCA, andLAML.

For the kidney renal clear cell carcinoma (KIRC) dataset, neitheralgorithm can find groups with significantly different survival usingany single data type. SNF cannot find significant different groups evenafter data integration. In contrast, when all data types are integratedby PINS, the p-value of the obtained subtypes becomes 4 order ofmagnitude more significant.

FIG. 18 displays the Kaplan-Meier analysis for KIRC. SNF finds 2 groupswith no significantly different survival (p=0.138). In contrast, PINSdiscovers 4 different groups with very different survival profiles(p=1.3×10⁻⁴). In phase 1, PINS finds 3 groups of patients: group 1consists of 50 patients, group 2 consists of 62 patients, and group 3consists of 12 patients who all survive. In phase 2, only group 1satisfy the splitting condition and is further divided into groups 1-1(25 patients) and 1-2 (25 patients) with very different survival.Remarkably, the significantly different groups can be obtained only whenthe 3 data types of data are integrated and analyzed together. Table IVshows the results obtained by both PINS and SNG on each individual datatype, as well as by integrating the data. Neither algorithm can findsubgroups with significantly different survival for any one of thesingle data types: mRNA, methylation, and miRNA. SNF cannot findsignificant different subtypes even after data integration. However,when all types are integrated by our proposed approach, the p-value ofthe obtained subtypes becomes 4 orders of magnitude more significant.The clinical and mutation information associated with each group arereported in Experimental Studies section below.

TABLE IV Subtypes Discovered by PINS and SNF for 6 TCGA Cancer DatasetsUsing Individual Data Types as well as Integrated Data TCGA Dataset PINSSNF Name #Sample Data Type #Cluster Cox p-value #Cluster Cox p-valueKIRC 124 mRNA 2 0.176 2 0.219 Methylation 3 0.111 3 0.577 miRNA 2 0.1382 0.138 Data integration 4 1.3 × 10 ⁻⁴ 2 0.138 GBM 273 mRNA 2 0.408 20.992 Methylation 2 10-4 2 0.017 miRNA 4 0.086 2 0.401 Data integration3 8.7 × 10 ⁻⁵ 4 0.062 LUSC 110 mRNA 3 0.125 3 0.095 Methylation 8 0.0192 0.376 miRNA 2 0.117 2 0.001 Data integration 3 9.7 × 10 ⁻³ 3 0.428BRCA 172 mRNA 2 0.902 2 0.969 Methylation 4 0.048 5 0.878 miRNA 3 0.2182 0.105 Integration 7 3.4 × 10 ⁻² 2 0.398 LAML 164 mRNA 5 0.003 2 0.327Methylation 6 0.239 2 0.993 miRNA 2 0.072 3 0.183 Data integration 4 2.4× 10 ⁻³ 2 3.7 × 10 ⁻² COAD 146 mRNA 2 0.113 2 0.148 Methylation 2 0.7412 0.389 iRNA 4 0.452 3 0.131 Data integration 5 0.201 2 0.296

For glioblastoma multiform (GBM) dataset, SNF cannot find significantdifferent groups using mRNA or miRNA but it finds 2 significantdifferent groups using methylation data (p=0.017). Similarly, PINScannot find significant groups using mRNA or miRNA but finds 2significantly different groups using methylation data (p=10⁻⁴). SNFcannot find significant different groups after data integration despitehaving significantly different groups using methylation data. Incontrast, data integration by PINS finds 3 groups with even moresignificant p-value than those of individual data types (p=8.7×10⁻⁵).

FIG. 19 displays the Kaplan-Meier analysis for GBM. SNF finds 4 groupswith no significantly different survival (p=0.062). In contrast, PINSdiscovers 3 different groups with very different survival profiles(p=8.7×10⁻⁵). In phase 1, PINS finds 2 groups of patients: group 1consists of 249 patients, and group 2 consists of 24 patients. In phase2, group 1 is further divided into group 1-1 (181 patients) and 1-2 (68patients). The clinical and mutation information associated with eachgroup are reported in the Experimental Studies section below.

FIG. 20 displays the Kaplan-Meier analysis for LUSC. SNF finds 3 groupswith no significantly different survival (p=0.428). In contrast, PINSdiscovers 5 different groups with different survival profiles(p=9.7×10⁻³). None of the groups is split further in phase 2.

FIG. 21 displays the Kaplan-Meier analysis for BRCA. SNF finds 3 groupswith no significantly different survival (p=0.428). In contrast, PINSdiscovers 5 different groups with different survival profiles (p=0.034).

FIG. 22 displays the Kaplan-Meier analysis for LAML. SNF finds 3 groupswith significantly different survival (p=0.037). PINS discovers 4different groups with different survival profiles (p=2.4×10⁻³).

FIG. 23 displays the Kaplan-Meier analysis for COAD. SNF discovers 2groups while PINS discovers 4 groups. Neither algorithm can findsubgroups with significant different survival for any one of the singledata types nor with the integrated data.

C. First Case Study—Glioblastoma Multiform (GBM)

We use PINS to subtype multi-omics data for 273 patients withglioblastoma multiform (GBM). The data types are mRNA expression (HTHG-U133A), DNA methylation (HumanMethylation27), and miRNA expression(Illumina HiSeq miRNASeq) as shown in Table III. FIG. 24 shows thediscovered subtypes. The upper panel shows the 2 subtypes discovered instage I and the lower panel shows the 3 subtypes discovered in stage II.The horizontal axes represent the time pass after entry into the studywhile the vertical axes represent estimated survival percentage.

We downloaded the somatic mutation data for GBM from TCGA website. Among273 samples, only 125 samples have somatic mutation information. Here wetake into consideration a mutation (gene) if it is positive in at least5 samples. We count the number of mutations in each subtype for eachgene. We then calculate the enrichment p-value using Fisher exact testand then adjust for multiple comparison using FDR correction. Table Vdisplays the 3 mutations that are enriched after FDR correction (atcutoff 0.01). We can see that IDH1 and ATRX mutations only appear insubtype 2 and not in other subtypes.

Here we integrate three data types (mRNA, DNA methylation, miRNA) of 131patients (Table III). FIG. 24 displays the results of PINS.

We downloaded the somatic mutation data for the samples in GBM datasetfrom TCGA, and apply the same approach explained above for the samplesin GBM. The genes which are mutationally enriched in each of thesubtypes are shown in Table VI. Again, we observe that the subtypesfound for GBM data set are significantly over-represented by samplesthat have at least one mutation in the genes that are mutated frequentlyin the samples in the different survival groups.

TABLE V Somatic Mutation Information for Glioblastoma Multiform (GMB)Group 1-1 Group 1-2 Group 2 78 (181) 38 (68) 9 (24) pFisher.fdr IDH1  00 9 7 × 10⁻¹² ATRX  0 0 8 4 × 10⁻¹⁰ TP53 24 8 9 0.001

D. Second Case Study—Kidney Renal Clear Cell Carcinoma (KIRC):

Multiple integrative approaches have been applied to identify thesubtypes of kidney renal clear cell carcinoma (KIRC). Depending on thedata types, these analyses can lead to different conclusions.

Here we integrate three data types (mRNA, DNA methylation, miRNA) of 131patients (Table III). FIG. 23 displays the results of PINS.

We downloaded the somatic mutation data for the samples in KIRC datasetfrom TCGA. Using the mutation information for the samples, which havebeen subtyped using the approach described above, we identify genes thatare mutated frequently in the samples in the different survival groups.The significance of mutations for each gene in each subtype is assessedby the number of samples with at least one mutation in that gene in thatsubtype and in the whole analysis using Fisher's exact test. We applythis approach to the KIRC datasets downloaded from TCGA. The genes whichare mutationally enriched in each of the subtypes are shown in Table V.We observe that the subtypes found in the analysis are significantlyover-represented by samples that have at least one mutation in thesegenes.

We use PINS to integrate multi-omics data for 124 patients with kidneyrenal clear cell carcinoma (KIRC). The data types are mRNA expression(Illumina HiSeq), DNA methylation (HumanMethylation27), and miRNAexpression (Illumina GASeq) as shown in Table III. FIG. 27 shows thediscovered subtypes. The upper panel shows the subtypes discovered instage I and the lower panel shows the subtypes discovered in stage II.The horizontal axes represent the time pass after entry into the studywhile the vertical axes represent estimated survival percentage. Instage I, PINS discovers 3 subtypes: subtype 1 (black) consisting of 50females, subtype 2 (red) consisting of 61 males and 1 female, andsubtype 3 (green) consisting of 9 males and 3 females. The survival ratebetween the large male and female groups is comparable.

In stage II, subtype 1 is equally divided into 2 subgroups with verydifferent survival rates. Subtype 3 is not considered in stage IIbecause it consists two few samples. Subtype 2 is not divided in stageII because the 3 data types give very contradictory signals for thisgroup. In summary, PINS discovers 4 different subtypes with verydifferent survival profiles (p=1.3×10⁻⁴). The significant differentsubtypes can be obtained only when the 3 data types are integrated andanalyzed together. Although the subtype discovery was done on moleculardata alone, with no use of clinical information, the 4 groups identifiedhave significantly different clinical profiles: group 1-1 containsshort-term survival women, group 1-2 contains longer-term survivalwomen, group 2 contains only men, and group 3 contains survivors (allpatients were still alive at the end of the study).

FIG. 28 displays the heatmap of features differential among kidneyrecall clear cell carcinoma (KIRC) subtypes. The panels display theexpression of the three data types: top panel—mRNA expression, middlepanel—DNA methylation, and bottom panel—miRNA expression. The color bandon the top shows the 4 subtypes (1-1, 1-2, 2, and 3). The features areselected as follows: we cluster the data using each featureindependently with k=4 (number of subtypes using PINS). Rand index (RI)is then calculated between the resulted clustering and PINS foundsubtypes. We then order all the RI values and show those that are rankedhighest for each data type. We show top 100 features for mRNA (out of17; 974), 100 features for DNA methylation (out of 23; 165), and 30features for miRNA (out of 590). mRNA and miRNA provide a clear signalbetween subtype 3 (highest survival) and the rest. The expression valuesfor those features are either much lower (red) or much higher (green)than the rest. DNA methylation gives a clear distinction between malesand females and thus helps to separate subtype 2 from subtype 1 (theunion of 1-1 and 1-2). mRNA helps to separate subtype 1-1 from 1-2.

1) KIRC Clinical Parameters:

Enrichment for different clinical characteristics was analyzed for eachof the four survival clusters. Table VI shows the numbers andpercentages of each of the 124 patients into each of the survivalclusters and clinical categories. FDR adjusted p-values were calculatedfor phenotype enrichment in each of the clusters versus the others, andgood versus poor survivors. Using an FDR cutoff of 5%, we find thatgroup 3 (all survivors) are typically between ages 50 and 60, withnormal calcium levels, while the medium survivors are typically younger,and poor survivors tend to be over 70. Group 2 (medium survival male)tends to have low calcium and fall into grade III. Poor survivors(female group 1-1) have elevated platelets and elevated calcium, andfall into grade G4, stages III and IV. Group 1-2 (medium-good survivalfemales) are predominantly stage I with normal hemoglobin. The two purefemale clusters (1-1 and 1-2) include the majority of patients withelevated platelets. The low survivors (1-1 and 2) compared to the highsurvivors (groups 3 and 1-2) have low hemoglobin and high grade.

TABLE VI Distribution of Patients in Four Survival Clusters in EachPhenotypic Category (A) Number in each (B) % phenotype in (C) % group ineach group each group phenotype Survival 3 1-2  2 1-1  3 1-2  2 1-1  31-2  2 1-1 Group Gender Female 3 25  1 25  6 46  2 46 25 100  2 100 Male9  0 61  0 13  0 87  0 75  0 98  0 Tumor G1 0  1  1  1  0 33 33 33  0  4 2  4 Grade G2 6 15 28 12 10 25 46 20 75  60 45  48 G3 2  9 28  5  5 2064 11 25  36 45  20 G4 0  0  5  7  0  0 42 58  0  0  8  28 AJCC I 7 1731  5 12 28 52  8 58  68 50  20 pathologic II 4  3  9  2 22 17 50 11 33 12 15  8 tumor stage III 1  4 18 12  3 11 51 34  8  16 29  48 IV 0  1 4  6  0  9 36 55  0  4  6  24 Serum Low 1  9 27  9  2 20 59 20 14  6068  43 Calcium Normal 6  6 12  9 18 18 36 27 86  40 30  43 LevelElevated 0  0  1  3  0  0 25 75  0  0  3  14 Hemoglobin Low 3  8 34 18 5 13 54 29 38  38 68  75 Level Normal 5 13 16  6 13 33 40 15 63  62 32 25 Platelet Normal 7 17 42 16  9 21 51 20 88  81 88  67 Count Elevated1  3  2  8  7 21 14 57 13  14  4  33 Age <50 1  6 16  1  4 25 67  4  8 24 26  4 50-60 7  4 15  6 22 13 47 19 58  16 24  24 60-70 3 10 17  5  929 49 14 25  40 27  20 >70 1  5 14 13  3 15 42 39  8  20 23  52

TABLE VII Distribution of Patients in Two Clusters and AJCC PathologicTumor Stage AJCC Pathologic female.1 female.2 Tumor Stage (poorsurvival) (good survival) Total Stage I  5 (23%) 17 (77%) 22 Stage II  2(40%)  3 (60%) 5 Stage III 12 (75%)  4 (25%) 16 Stage IV  6 (86%)  1(14%) 7 Total 25 25

2) KIRC Female Subgroups—Clinical Parameters:

The low survival group includes 86% of the Stage IV cases, while thehigh survival group includes 77% of the Stage I cases, representing anFDR corrected p-values of less than 0.5%. Other parameters that weresignificant at FDR<5% included tumor grade (poor survivors had a higherincidence of ‘G4’ tumors, FDR 2%), tumor status (poor survivors were‘with tumor’, FDR 2%), hemoglobin level (poor survivors had low levels,FDR 2%), metastasis (2%).

3) KIRC Female Subgroups—Differential Gene Expression:

Based on absolute log-fold-change of 1.5 and maximum FDR adjustedp-value of 1%, there were 165 differentially expressed genes betweenlong- and short-term survivors. Eighty percent (132) of these weredown-regulated in the poor survivors. Functional analysis of all 165genes using iPathwayGuide points to damaged proximal tubules in thenephrons of women with poor outcome. Most common renal cell carcinomasoccur in the proximal tubules. Two KEGG pathways had FDRs on the orderof 10⁻⁴: ‘Metabolic Pathways’, and ‘Mineral Absorption’. Severaldifferentially expressed solute carriers on the Mineral AbsorptionPathway are located in ‘brush border membrane’, shown in FIG. 36. Inkidney, brush border membranes are found in the proximal tubules, whichcarry filtrate away from the glomerulus in the nephron, and support thesecretion and absorption of charged molecules into and out of thefiltrate. Gene Ontology (GO) terms analyzed at 1% FDR significancesupport a hypothesis of damage to proximal tubule membranes. Thirteensignificant Cellular Component terms were related to plasma membrane, inparticular ‘brush border membrane’. All 23 Biological Process termsconcern known proximal tubule functions: metabolic/catabolic processes,immune response, transmembrane and ionic transport. All but one(‘glucosidase activity’) of 15 Molecular Function terms involve activeionic transport across a plasma membrane. These terms included manydifferentially expressed solute carriers. The alpha-glucosidaseprecursor has been localized to the proximal tubule brush border, whereit is secreted into the urine.

TABLE VIII Gene Ontology and Disease Summary Database 33 up 132 down GOBiological Metallothionein and Metabolic processes, Process metallopeptidase activity transport GO Molecular Enzyme InhibitionTransporter activity, binding Function GO Cellular Extracellular RegionBrush Border Membrance, Component Plasma Membrane Webgestalt Cancer andRespiratory Kidney Disease Disease Tract Diseases

4) KIRC Female Subgroups—MiRNAs:

iPathway Guide also outputs a ranked list of miRNAs that areup-regulated between short and long survival women. We select the top 10miRNA and performs the t-test using miRNA expression. Two microRNAs thatwere significantly up-regulated (after FDR correction) in the lowsurvival group were among the 10 miRNAs identified by iPathwayGuide withsignificant enrichment of down-regulated target genes: hsa-mir-497 andhsa-mir-27a dysregulation. These 2 miRNAs have been observed in multiplecancers and may be up or down regulated depending on the context(mircancer.ecu.edu).

Hsa-mir-497 is a tumor suppressor reported to be involved inantiproliferation (cell cycle—G1 arrest, p53 correlation), increasedapoptosis (through WEE1), suppression of angiogenesis (through VEGFA),suppression of migration and invasion (through SMURF1), and modulationof multidrug resistance (through BCL2). Hsa-mir-497 dysregulation hasbeen found in carcinomas (prostate, bladder, colon, pancreas, breast,lung, gastric, liver, cervical, peritoneal, . . . ), and is thought toparticipate in the following biological processes/pathways: cadherin,WNT, T-cell activation, cell-cycle progression, apoptosis, PI3K/AKT, andMAPK/ERK.

Hsa-mir-27a is considered to be an ‘onco-miR’ with potential SNPinactivation. It is associated with numerous cancers: breast (familial),cervical (HeLa), glioma, AML, ALL, renal, colorectal, prostate, gastric,ovarian, pancreatic, lung, . . . and carcinomas: oral squamous cell,hepatocellular, esophageal squamous cell, gastric adenocarcinoma, . . .It has been implicated in promoting cellular proliferation, migrationand invasion and inhibiting apoptosis (through MCPH1, FOX01 and SPRY2),control of endothelial cell repulsion and angiogenesis (through SEMA6Aand VEGF), promoting metastasis by inducing epithelial to mesenchymaltransition, enhanced expression of proinflammatory cytokines, andimpairment of adipocyte differentiation and mitochondrial function. Itis thought to participate in the following biological processes andpathways: cell adhesion and cell-cell interactions, VEGF mediatedsignaling, MAPK/ERK signaling, EGFR signaling, cell cycle, NF-kbsignaling, proinflammatory cytokines, and basal transcription of thep53-273-H/mir-27a/EGFR pathway.

IV. EXEMPLARY EMBODIMENT

Referring now to FIG. 32, an exemplary implementation of a medicaltesting system utilizing the disclosed concepts will now be described.As illustrated a laboratory genetic analyzer 50 is supplied with samples52 from a plurality of human subjects and the results are stored in datastore 54 as the base dataset. Next the PINS unsupervised clusteranalysis process 56, as described herein, is performed using dataprocessor 60 and its associated non-transitory memory 62 (e.g. RAMand/or non-transitory storage) to find the optimal clusters 58 from thebase dataset in data store 54. The optimal clusters so discovered arestored as cluster data in memory 62.

Thereafter, a DNA sample from an individual patient is obtained at 74and analyzed with a benchtop analyzer 72. Essentially, the benchtopanalyzer 72 is obtaining sample data from the individual patient that isidentical to or of a similar character to the data collected from theplurality of human subjects as processed by the laboratory geneticanalyzer 50. The output from the benchtop analyzer 72 is supplied todata processor 66. Data processor has an associated non-transitorymemory 64 into which has been loaded the cluster data from memory 62,representing the previously discovered unsupervised optimal clusters.Processor 66 uses the data from the individual patient to perform asupervised classification analysis which identifies which cluster mostclosely represents the data from the individual patient, as shown at 70.

With the individual patient now assigned to the optimal cluster thatmost closely corresponds to that patient's actual genetically analyzeddata, the treating physician selects the treatment regimen that is bestsuited to that patient's needs. As noted above, without this knowledgethe conventional treatment protocol might well dictate that the patientbe given the “most popular” treatment, unaware that in this case thattreatment will not work and thus valuable life-giving time is beingwasted.

V. CONCLUSIONS

In this disclosure, we present a new approach for data integration anddisease subtyping. Our contribution is two-folds. First, we proposed anovel method to efficiently cluster high-dimensional data. This approachadds noise to the input to learn the data's behavior. The algorithm thenchooses the partitioning that is the most robust against dataperturbation. Second, we integrate multi-omics data by combining thesimilarity matrices of individual data types. Our framework looks forstrong connections across all data types to determine the number ofclusters for the final partitioning. This makes the partitioning morestable than by looking at the partitioning of each data type alone.

The advantage of the new approach is demonstrated by extensive dataanalysis. We examine 8 gene expression datasets of different diseases:lung cancer, leukemia, and brain tumors. Rand Index (RI) and AdjustedRand Index (ARI) are used as metrics to compare the performance of PINS,Consensus Clustering (CC), and Similarity Network Fusion (SNF). For allthe 8 datasets, Perturbation Clustering outperforms its competitor andalso correctly identifies the number of subtypes for most of thedatasets.

To evaluate the new approach's ability to combine multi-omics data, weexamine 6 cancers available on The Cancer Genome Atlas (TCGA):glioblastoma multiform (GBM), lung squamous cell carcinoma (LUSC),breast invasive carcinoma (BRCA), acute myeloid leukemia (LAML), kidneyrenal clear cell carcinoma (KIRC) and colon adenocarcinoma (COAD). Usingthe Cox log-rank test, we show that our framework has a clear advantageamong the competing methods.

APPENDIX

Adjusted Rand Index

We use Rand index (RI) and adjusted Rand Index (ARI) as the metrics toassess the agreement between a clustering and the ground truth (trueclasses of the elements). Rand Index of 2 partitionings is the number ofpairs that agree divided by the total number of pairs. In short,

${RI} = \frac{a + b}{\begin{pmatrix}N \\2\end{pmatrix}}$

where a is the number of pairs that are clustered together in bothpartitionings, b is the number of pairs that are separated in bothpartitionings, and

$\quad\begin{pmatrix}N \\2\end{pmatrix}$

is the total possible pairs from N elements

TABLE IX Contingency Table of Two Partitionings Class/Cluster G₁ G₂ . .. G_(g) Total C₁ n₁₁ n₁₂ . . . n_(1g) n_(1.) C₂ n₂₁ n₂₂ . . . n_(2g)n_(2.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C_(r)n_(r1) . . . n_(rg) n_(r.) Total n_(.1) n_(.2) . . . n_(.g) n.. = n

The adjusted Rand index (ARI) is the corrected-for-chance version of theRand index. Let us consider two partitionings {C₁, C₂, . . . , C_(Γ)}and {G₁, G₂, . . . , G_(Γ)}. The agreement between these twopartitionings is summarized by the contingency Table IX. The adjustedRand index (ARI) is the corrected-for-chance version of the Rand index.ARI can be calculated as follows:

${ARI} = \frac{{\sum_{i,j}\begin{pmatrix}n_{ij} \\2\end{pmatrix}} - {\left\lbrack {\sum_{i}{\begin{pmatrix}n_{i.} \\2\end{pmatrix}{\sum_{j}\begin{pmatrix}n_{.j} \\2\end{pmatrix}}}} \right\rbrack/\begin{pmatrix}n \\2\end{pmatrix}}}{{\frac{1}{2}\left\lbrack {{\sum_{i}\begin{pmatrix}n_{i.} \\2\end{pmatrix}} + {\sum_{j}\begin{pmatrix}n_{.j} \\2\end{pmatrix}}} \right\rbrack} - {\left\lbrack {\sum_{i}{\begin{pmatrix}n_{i.} \\2\end{pmatrix}{\sum_{j}\begin{pmatrix}n_{.j} \\2\end{pmatrix}}}} \right\rbrack/\begin{pmatrix}n \\2\end{pmatrix}}}$

While RI falls into the interval [0,1], ARI can be negative. It can beshown that ARI has an expected value of 0 for two random partitionings.

1-18. (canceled)
 19. A disease subtyping/patient subgroupingcomputerized method based on the molecular profile of a patient, thecomputerized method comprising: receiving assay results of samplescollected from a population of human subjects; generating a populationdataset from the assay results that defines a molecular profile for thepopulation; receiving an assay result of a patient sample; generating apatient dataset from the assay result that defines a molecular profileof the patient; and analyzing the population dataset by: a) acquiringthe population dataset from the population of human subjects; b)performing an unsupervised cluster analysis on the population dataset todefine clusters of the population, wherein each cluster represents adisease subtype or a patient subgroup; c) using machine learningtechniques to act predetermined features of each disease subtype orpatient subgroup; and d) finding a disease subtype or patient subgroupthat represents the closest match to the molecular profile of thepatient.
 20. The computerized method according to claim 19, furthercomprising: analyzing the patient dataset and selecting a diseasetreatment regimen.
 21. The computerized method according to claim 19,wherein the population of human subjects comprises a plurality of humansubjects that are stricken with the same disease.
 22. The computerizedmethod according to claim 21, wherein the patient has the same diseaseas the plurality of human subjects and the method comprises finding thedisease subtype of the patient's disease.
 23. The computerized methodaccording to claim 22, further comprising treating the patient based onthe disease subtype.
 24. he computerized method according to claim 19,wherein the population of human subjects comprises a plurality of humansubjects that have been treated with the same drug.
 25. The methodaccording to claim 24, wherein the patient has been treated with thesame drug as the plurality of human subjects and the method comprisessubgrouping the plurality of human subjects based on responses to thedrug.
 26. The computerized method according to claim 25, furthercomprising finding the subgroup of human subjects to which the patientbelongs based on the patient's response to the drug.
 27. Thecomputerized method according to claim 19, wherein the population ofhuman subjects comprises a plurality of human subjects who have the samedisease, wherein the disease manifests differently in groups of theplurality of subjects, and the method further comprises: storing theclusters of the population in a data store comprising a non-transitorycomputer-readable memory; processing the patient dataset using acomputer-implemented classifier, the classifier ingesting the clustersin the data store and using the clusters to find which of the clustersrepresents a closest match to the patient dataset; storing the clusterfound to represent the closest match to the patient dataset as patientclassification data in a non-transitory computer-readable memory; andusing the patient classification data in the selection of a diseasetreatment regimen, wherein the unsupervised cluster analysis comprises:(a) applying a computer-implemented algorithm to the population datasetto construct a set of first connectivity matrices which are then storedin non-transitory computer-readable memory; (b) using acomputer-implemented algorithm to construct and store in non-transitorycomputer-readable memory a perturbed dataset by introducing noise intothe population dataset; (c) applying a computer-implemented algorithm tothe perturbed dataset to construct a set of perturbed connectivitymatrices which are then stored in non-transitory computer-readablememory; (d) using a computer-implemented algorithm to perform astability assessment that reads from memory and compares the firstconnectivity matrices with the perturbed connectivity matrices; (e)using a computer-implemented algorithm to select from among the firstset of connectivity matrices the one matrix whose correspondingperturbed matrix was least affected by the introduction of noise; and(f) storing the selected one matrix in non-transit computer-readablememory and using a computer-implemented algorithm to construct theclusters.
 28. The computerized method of claim 27 wherein the assayresults provide quantitative biological data selected from the groupconsisting of mRNA expression, DNA methylation, miRNA expression,protein abundance (proteomics), and metabolic concentrations(metabolomics) and genetic data.
 29. The computerized method accordingto claim 19, further comprising: using the population dataset as input,wherein the population of human subjects have the same disease, thepopulation dataset being developed by high-throughput assays selectedfrom the group consisting of mRNA expression, DNA methylation, miRNAexpression, proteomic expression, and metabolic concentration, formingthe molecular profile for the population, and performing the followingsteps: a) applying k-means to the population dataset with differentsettings of numbers of clusters, each setting resulting in oneclustering of patients; b) constructing an original connectivity matrixfor each clustering of patients; c) introducing Gaussian noise to thedataset and constructing the perturbed connectivity matrices; d)performing stability assessment between the original and perturbedconnectivity, matrices; e) selecting one original connectivity matrixthat is the least affected by noise; and f) choosing a cluster ofpatients that corresponds to the selected connectivity matrix as theoptimal subtyping.
 30. The computerized method according to claim 19,further comprising: generating the population dataset from a pluralityof types of molecular data obtained from the population of humansubjects each human subject in the population having the same disease,defining T molecular profiles for the population of human subjects andperforming the following steps: a) applying the algorithm defined in(1)-(6) below to define subtypes of the disease for each data type: (1)applying a computer-implemented algorithm to the first quantitativebiological dataset to construct a set of first connectivity matriceswhich are then stored in non-transitory computer-readable memory; (2)using a computer-implemented algorithm to construct and store innon-transitory computer-readable memory a perturbed dataset byintroducing noise into the first quantitative biological dataset; (3)applying a computer-implemented algorithm to the perturbed dataset toconstruct a set of perturbed connectivity matrices which are then storedin non-transitory computer-readable memory; (4) using acomputer-implemented algorithm to perform a stability assessment thatreads from memory and compares the first connectivity matrices with theperturbed connectivity matrices; (5) using a computer-implementedalgorithm to select from among the first set of connectivity matricesthe one matrix whose corresponding perturbed matrix is least affected bythe introduction of noise; and (6) storing the selected one matrix innon-transitory computer-readable memory and using a computer-implementedalgorithm to construct the plurality of clusters, b) constructing theconnectivity matrix for each data type; c) constructing the similaritymatrix from T connectivity matrices to gain a holistic view of theintegrated data; and d) partitioning the integrated similarity matrix todefine subtypes of the disease using the integrated data.
 31. Thecomputerized method according to claim 19, comprising: generating thepopulation dataset from the population of human subjects, each humansubject of the population having the disease; processing the populationdataset using the unsupervised cluster analysis to define the clusters;storing the clusters in a data store comprising a non-transitorycomputer-readable memory; processing the patient dataset using acomputer-implemented classifier, the classifier ingesting the clustersin the data store and using the clusters to find which of the clustersrepresents a closest match to the patient dataset; storing the clusterfound to represent the closest match as patient classification data in anon-transitory computer-readable memory; and using the patientclassification data in the selection of a disease treatment regimen,wherein the unsupervised cluster analysis comprises: (a) applying acomputer-implemented algorithm to the population dataset construct a setof first connectivity matrices which are then stored in non-transitorycomputer-readable memory; (b) using a computer-implemented algorithm toconstruct and store in non-transitory computer-readable memory aperturbed dataset by introducing noise into the population dataset; (c)applying a computer-implemented algorithm to the perturbed dataset toconstruct a set of perturbed connectivity matrices which are then storedin non-transitory computer-readable memory; (d) using acomputer-implemented algorithm to perform a stability assessment thatreads from memory and compares the first connectivity matrices with theperturbed connectivity matrices; (e) using a computer-implementedalgorithm to select from among the first set of connectivity matricesthe one matrix whose corresponding perturbed matrix is least affected bythe introduction of noise; and (f) storing the selected one matrix innon-transitory computer-readable memory and using a computer-implementedalgorithm to construct the clusters.
 32. A system comprising: at leastone processor and a memory coupled to the at least one processor;wherein the memory stores instructions for execution by the at least oneprocessor, and wherein the instructions include: receiving assay resultsof samples collected from a population of human subjects; generating apopulation dataset from the assay results that defines a molecularprofile for the population; receiving an assay result of a patientsample; generating a patient dataset from the assay result that definesa molecular profile of the patient; and analyzing the population datasetby: a) acquiring the population dataset from the population of humansubjects; b) performing an unsupervised cluster analysis on thepopulation dataset to define clusters of the population, wherein eachcluster represents a disease subtype or a patient subgroup; c) usingmachine learning techniques to extract predetermined features of eachdisease subtype or patient subgroup; and d) finding a disease subtype orpatient subgroup that represents the closest match to the molecularprofile of the patient.